In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

Business Cases for Data Science

Business Case 4 - ManyGiftsUK recommender system

Group AA

Members:

  • Emil Ahmadov (m20201004@novaims.unl.pt)
  • Doris Macean (m20200609@novaims.unl.pt)
  • Doyun Shin (m20200565@novaims.unl.pt)
  • Anastasiia Tagiltseva (m20200041@novaims.unl.pt)

1. Business Understanding

ManyGiftsUK asked us

  1. Explore the data and build models to answer the problems:

    -Recommender system: the website homepage offers a wide range of products the user might be interested on

    -Cold start: offer relevant products to new customers

  2. Implement adequate evaluation strategies and select an appropriate quality measure

  3. In the deployment phase, elaborate on the challenges and recommendations in implementing the recommender system

Project Plan

Phase Time Resources Risks
Business Understanding 2 days All analysts Economic and market changes
Data Understanding 2 days All analysts Data problems, technological problems
Data Preparation 2 days Data scientists, DB engineers Data problems, technological problems
Modeling 4 days Data scientists Technological problems, inability to build adequate model
Evaluation 2 days All analysts Economic change inability to implement results
Deployment 2 days Data scientists, DB engineers, implementation team Economic change inability to implement results

2. Data Understanding

Metadata

Name Meaning
InvoiceNo Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation
StockCode Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product
Description Product (item) name. Nominal
Quantity The quantities of each product (item) per transaction. Numeric
InvoiceDate Invoice Date and time. Numeric, the day and time when each transaction was generated
UnitPrice Unit price. Numeric, Product price per unit in pounds
CustomerID Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer
Country Country name. Nominal, the name of the country where each customer resides

2.1 Exploratory Data Analysis

In [2]:
# conda install implicit -c conda-forge -n root
In [3]:
# pip install implicit
In [4]:
import pandas as pd
import numpy as np
import implicit
from scipy import sparse
from scipy.sparse import coo_matrix
from implicit.als import AlternatingLeastSquares
from implicit.evaluation import ranking_metrics_at_k
from sklearn.decomposition import TruncatedSVD
from tqdm import tqdm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

from sklearn.metrics.pairwise import cosine_similarity
from implicit.evaluation import ranking_metrics_at_k, train_test_split

import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use("ggplot")
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')
In [5]:
#load the dataset
retail = pd.read_csv('retail.csv')
retail.head()
Out[5]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 12/1/2010 8:26 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 12/1/2010 8:26 3.39 17850.0 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 12/1/2010 8:26 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 12/1/2010 8:26 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 12/1/2010 8:26 3.39 17850.0 United Kingdom
In [6]:
#shape of our dataset
retail.shape
Out[6]:
(541909, 8)
In [7]:
# Correctly encode the variables
retail['InvoiceDate'] = pd.to_datetime(retail['InvoiceDate'])
retail = retail.astype({'CustomerID': object})
In [8]:
#exploring the unique values of each attribute
print("Number of transactions: ", retail['InvoiceNo'].nunique())
print("Number of products: ",retail['StockCode'].nunique())
print("Number of customers:", retail['CustomerID'].nunique() )
print("Percentage of customers NA (new): ", round(retail['CustomerID'].isnull().sum() * 100 / len(retail),2),"%" )
print('Number of countries: ',retail['Country'].nunique())
Number of transactions:  25900
Number of products:  4070
Number of customers: 4372
Percentage of customers NA (new):  24.93 %
Number of countries:  38
In [9]:
retail['StockCode'].value_counts().head(30).plot(kind='bar')
Out[9]:
<AxesSubplot:>
In [10]:
retail.groupby(['StockCode'])['Quantity'].sum().sort_values(ascending=False).head(30).plot(kind='bar')
Out[10]:
<AxesSubplot:xlabel='StockCode'>
In [11]:
# address incorrect product descriptions
retail['Description'].value_counts()
Out[11]:
WHITE HANGING HEART T-LIGHT HOLDER    2369
REGENCY CAKESTAND 3 TIER              2200
JUMBO BAG RED RETROSPOT               2159
PARTY BUNTING                         1727
LUNCH BAG RED RETROSPOT               1638
                                      ... 
GREEN WITH METAL BAG CHARM               1
re-adjustment                            1
ORANGE FELT VASE + FLOWERS               1
water damaged                            1
CAKESTAND, 3 TIER, LOVEHEART             1
Name: Description, Length: 4223, dtype: int64
In [12]:
# check for missing values
retail.isnull().sum()
Out[12]:
InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64
In [13]:
# check for duplicates
retail.duplicated().sum()
Out[13]:
5268
In [14]:
#drop duplicated ones
retail.drop_duplicates(inplace=True)
In [15]:
retail.shape
Out[15]:
(536641, 8)

2.1.1. UnitPrice

In [16]:
sns.boxplot(x = retail['UnitPrice'])
Out[16]:
<AxesSubplot:xlabel='UnitPrice'>
In [17]:
# lets check high values
retail[abs(retail['UnitPrice']) > 3000]
Out[17]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
15016 C537630 AMAZONFEE AMAZON FEE -1 2010-12-07 15:04:00 13541.33 NaN United Kingdom
15017 537632 AMAZONFEE AMAZON FEE 1 2010-12-07 15:08:00 13541.33 NaN United Kingdom
16232 C537644 AMAZONFEE AMAZON FEE -1 2010-12-07 15:34:00 13474.79 NaN United Kingdom
16313 C537647 AMAZONFEE AMAZON FEE -1 2010-12-07 15:41:00 5519.25 NaN United Kingdom
16356 C537651 AMAZONFEE AMAZON FEE -1 2010-12-07 15:49:00 13541.33 NaN United Kingdom
16357 C537652 AMAZONFEE AMAZON FEE -1 2010-12-07 15:51:00 6706.71 NaN United Kingdom
43702 C540117 AMAZONFEE AMAZON FEE -1 2011-01-05 09:55:00 16888.02 NaN United Kingdom
43703 C540118 AMAZONFEE AMAZON FEE -1 2011-01-05 09:57:00 16453.71 NaN United Kingdom
96844 C544587 AMAZONFEE AMAZON FEE -1 2011-02-21 15:07:00 5575.28 NaN United Kingdom
96845 C544589 AMAZONFEE AMAZON FEE -1 2011-02-21 15:11:00 5258.77 NaN United Kingdom
124741 C546987 AMAZONFEE AMAZON FEE -1 2011-03-18 12:56:00 5693.05 NaN United Kingdom
124787 C546989 AMAZONFEE AMAZON FEE -1 2011-03-18 12:59:00 5225.03 NaN United Kingdom
173277 C551685 POST POSTAGE -1 2011-05-03 12:51:00 8142.75 16029.0 United Kingdom
173382 551697 POST POSTAGE 1 2011-05-03 13:46:00 8142.75 16029.0 United Kingdom
173391 C551699 M Manual -1 2011-05-03 14:12:00 6930.00 16029.0 United Kingdom
191385 C553354 AMAZONFEE AMAZON FEE -1 2011-05-16 13:54:00 5876.40 NaN United Kingdom
191386 C553355 AMAZONFEE AMAZON FEE -1 2011-05-16 13:58:00 7006.83 NaN United Kingdom
222681 C556445 M Manual -1 2011-06-10 15:31:00 38970.00 15098.0 United Kingdom
239250 C558036 AMAZONFEE AMAZON FEE -1 2011-06-24 12:31:00 5791.18 NaN United Kingdom
239251 C558037 AMAZONFEE AMAZON FEE -1 2011-06-24 12:33:00 4534.24 NaN United Kingdom
262413 C559915 AMAZONFEE AMAZON FEE -1 2011-07-13 15:18:00 4383.62 NaN United Kingdom
262414 C559917 AMAZONFEE AMAZON FEE -1 2011-07-13 15:21:00 6497.47 NaN United Kingdom
268027 C560372 M Manual -1 2011-07-18 12:26:00 4287.63 17448.0 United Kingdom
268028 560373 M Manual 1 2011-07-18 12:30:00 4287.63 NaN United Kingdom
271151 C560647 M Manual -1 2011-07-20 11:31:00 3060.60 18102.0 United Kingdom
287103 C562062 AMAZONFEE AMAZON FEE -1 2011-08-02 12:17:00 4575.64 NaN United Kingdom
287150 C562086 AMAZONFEE AMAZON FEE -1 2011-08-02 12:27:00 6721.37 NaN United Kingdom
293842 C562647 M Manual -1 2011-08-08 12:56:00 3155.95 15502.0 United Kingdom
297723 562955 DOT DOTCOM POSTAGE 1 2011-08-11 10:14:00 4505.17 NaN United Kingdom
299982 A563185 B Adjust bad debt 1 2011-08-12 14:50:00 11062.06 NaN United Kingdom
299983 A563186 B Adjust bad debt 1 2011-08-12 14:51:00 -11062.06 NaN United Kingdom
299984 A563187 B Adjust bad debt 1 2011-08-12 14:52:00 -11062.06 NaN United Kingdom
312092 C564340 AMAZONFEE AMAZON FEE -1 2011-08-24 14:50:00 4527.65 NaN United Kingdom
312246 C564341 AMAZONFEE AMAZON FEE -1 2011-08-24 14:53:00 6662.51 NaN United Kingdom
342611 C566889 AMAZONFEE AMAZON FEE -1 2011-09-15 13:50:00 5522.14 NaN United Kingdom
342635 C566899 AMAZONFEE AMAZON FEE -1 2011-09-15 13:53:00 7427.97 NaN United Kingdom
374542 569382 M Manual 1 2011-10-03 16:44:00 3155.95 15502.0 United Kingdom
383495 C570025 AMAZONFEE AMAZON FEE -1 2011-10-07 10:29:00 5942.57 NaN United Kingdom
406404 C571750 M Manual -1 2011-10-19 11:16:00 3949.32 12744.0 Singapore
406406 571751 M Manual 1 2011-10-19 11:18:00 3949.32 12744.0 Singapore
422351 573077 M Manual 1 2011-10-27 14:13:00 4161.06 12536.0 France
422375 C573079 M Manual -2 2011-10-27 14:15:00 4161.06 12536.0 France
422376 573080 M Manual 1 2011-10-27 14:20:00 4161.06 12536.0 France
429248 C573549 AMAZONFEE AMAZON FEE -1 2011-10-31 13:23:00 5942.57 NaN United Kingdom
446434 C574897 AMAZONFEE AMAZON FEE -1 2011-11-07 15:03:00 5877.18 NaN United Kingdom
446533 C574902 AMAZONFEE AMAZON FEE -1 2011-11-07 15:21:00 8286.22 NaN United Kingdom
524601 C580604 AMAZONFEE AMAZON FEE -1 2011-12-05 11:35:00 11586.50 NaN United Kingdom
524602 C580605 AMAZONFEE AMAZON FEE -1 2011-12-05 11:36:00 17836.46 NaN United Kingdom
In [18]:
#some of these values have 'AMAZON FEE' and are about post offices. lets take a closer look:
retail[retail['Description']=='AMAZON FEE']
Out[18]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
14514 C537600 AMAZONFEE AMAZON FEE -1 2010-12-07 12:41:00 1.00 NaN United Kingdom
15016 C537630 AMAZONFEE AMAZON FEE -1 2010-12-07 15:04:00 13541.33 NaN United Kingdom
15017 537632 AMAZONFEE AMAZON FEE 1 2010-12-07 15:08:00 13541.33 NaN United Kingdom
16232 C537644 AMAZONFEE AMAZON FEE -1 2010-12-07 15:34:00 13474.79 NaN United Kingdom
16313 C537647 AMAZONFEE AMAZON FEE -1 2010-12-07 15:41:00 5519.25 NaN United Kingdom
16356 C537651 AMAZONFEE AMAZON FEE -1 2010-12-07 15:49:00 13541.33 NaN United Kingdom
16357 C537652 AMAZONFEE AMAZON FEE -1 2010-12-07 15:51:00 6706.71 NaN United Kingdom
43702 C540117 AMAZONFEE AMAZON FEE -1 2011-01-05 09:55:00 16888.02 NaN United Kingdom
43703 C540118 AMAZONFEE AMAZON FEE -1 2011-01-05 09:57:00 16453.71 NaN United Kingdom
96844 C544587 AMAZONFEE AMAZON FEE -1 2011-02-21 15:07:00 5575.28 NaN United Kingdom
96845 C544589 AMAZONFEE AMAZON FEE -1 2011-02-21 15:11:00 5258.77 NaN United Kingdom
124741 C546987 AMAZONFEE AMAZON FEE -1 2011-03-18 12:56:00 5693.05 NaN United Kingdom
124787 C546989 AMAZONFEE AMAZON FEE -1 2011-03-18 12:59:00 5225.03 NaN United Kingdom
135534 547901 AMAZONFEE AMAZON FEE 1 2011-03-28 11:57:00 219.76 NaN United Kingdom
135590 C547904 AMAZONFEE AMAZON FEE -1 2011-03-28 12:02:00 219.76 NaN United Kingdom
191385 C553354 AMAZONFEE AMAZON FEE -1 2011-05-16 13:54:00 5876.40 NaN United Kingdom
191386 C553355 AMAZONFEE AMAZON FEE -1 2011-05-16 13:58:00 7006.83 NaN United Kingdom
239250 C558036 AMAZONFEE AMAZON FEE -1 2011-06-24 12:31:00 5791.18 NaN United Kingdom
239251 C558037 AMAZONFEE AMAZON FEE -1 2011-06-24 12:33:00 4534.24 NaN United Kingdom
262413 C559915 AMAZONFEE AMAZON FEE -1 2011-07-13 15:18:00 4383.62 NaN United Kingdom
262414 C559917 AMAZONFEE AMAZON FEE -1 2011-07-13 15:21:00 6497.47 NaN United Kingdom
287103 C562062 AMAZONFEE AMAZON FEE -1 2011-08-02 12:17:00 4575.64 NaN United Kingdom
287150 C562086 AMAZONFEE AMAZON FEE -1 2011-08-02 12:27:00 6721.37 NaN United Kingdom
312092 C564340 AMAZONFEE AMAZON FEE -1 2011-08-24 14:50:00 4527.65 NaN United Kingdom
312246 C564341 AMAZONFEE AMAZON FEE -1 2011-08-24 14:53:00 6662.51 NaN United Kingdom
342611 C566889 AMAZONFEE AMAZON FEE -1 2011-09-15 13:50:00 5522.14 NaN United Kingdom
342635 C566899 AMAZONFEE AMAZON FEE -1 2011-09-15 13:53:00 7427.97 NaN United Kingdom
383495 C570025 AMAZONFEE AMAZON FEE -1 2011-10-07 10:29:00 5942.57 NaN United Kingdom
429248 C573549 AMAZONFEE AMAZON FEE -1 2011-10-31 13:23:00 5942.57 NaN United Kingdom
429249 C573550 AMAZONFEE AMAZON FEE -1 2011-10-31 13:32:00 2185.04 NaN United Kingdom
446434 C574897 AMAZONFEE AMAZON FEE -1 2011-11-07 15:03:00 5877.18 NaN United Kingdom
446533 C574902 AMAZONFEE AMAZON FEE -1 2011-11-07 15:21:00 8286.22 NaN United Kingdom
524601 C580604 AMAZONFEE AMAZON FEE -1 2011-12-05 11:35:00 11586.50 NaN United Kingdom
524602 C580605 AMAZONFEE AMAZON FEE -1 2011-12-05 11:36:00 17836.46 NaN United Kingdom
In [19]:
#lets drop fees paid to Amazon
retail.drop(retail[retail['Description']=='AMAZON FEE'].index, axis=0, inplace=True)
In [20]:
retail[retail['Description'].str.contains('POSTAGE', na=False)]['Description'].value_counts()
Out[20]:
POSTAGE           1252
DOTCOM POSTAGE     709
Name: Description, dtype: int64
In [21]:
retail[retail['Description'].str.contains('POSTAGE', na=False)]
Out[21]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
45 536370 POST POSTAGE 3 2010-12-01 08:45:00 18.00 12583.0 France
386 536403 POST POSTAGE 1 2010-12-01 11:27:00 15.00 12791.0 Netherlands
1123 536527 POST POSTAGE 1 2010-12-01 13:04:00 18.00 12662.0 Germany
1814 536544 DOT DOTCOM POSTAGE 1 2010-12-01 14:32:00 569.77 NaN United Kingdom
3041 536592 DOT DOTCOM POSTAGE 1 2010-12-01 17:06:00 607.49 NaN United Kingdom
... ... ... ... ... ... ... ... ...
541216 581494 POST POSTAGE 2 2011-12-09 10:13:00 18.00 12518.0 Germany
541540 581498 DOT DOTCOM POSTAGE 1 2011-12-09 10:26:00 1714.17 NaN United Kingdom
541730 581570 POST POSTAGE 1 2011-12-09 11:59:00 18.00 12662.0 Germany
541767 581574 POST POSTAGE 2 2011-12-09 12:09:00 18.00 12526.0 Germany
541768 581578 POST POSTAGE 3 2011-12-09 12:16:00 18.00 12713.0 Germany

1961 rows × 8 columns

In [22]:
# descriptions containing 'POSTAGE' are not products and they have different unit prices for same descriptions
#for the purpose of analysis we can drop them
retail.drop(retail[retail['Description'].str.contains('POSTAGE', na=False)].index, axis=0, inplace=True)
In [23]:
#lets check if there still are negative unit prices:
retail[retail['UnitPrice'] < 0]
Out[23]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
299983 A563186 B Adjust bad debt 1 2011-08-12 14:51:00 -11062.06 NaN United Kingdom
299984 A563187 B Adjust bad debt 1 2011-08-12 14:52:00 -11062.06 NaN United Kingdom
In [24]:
#they dont provide any information about products, so we can drop them
retail.drop(retail[retail['UnitPrice'] < 0].index, axis=0, inplace=True)
In [25]:
#lets check high values once again:
retail[abs(retail['UnitPrice']) > 3000]
Out[25]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
173391 C551699 M Manual -1 2011-05-03 14:12:00 6930.00 16029.0 United Kingdom
222681 C556445 M Manual -1 2011-06-10 15:31:00 38970.00 15098.0 United Kingdom
268027 C560372 M Manual -1 2011-07-18 12:26:00 4287.63 17448.0 United Kingdom
268028 560373 M Manual 1 2011-07-18 12:30:00 4287.63 NaN United Kingdom
271151 C560647 M Manual -1 2011-07-20 11:31:00 3060.60 18102.0 United Kingdom
293842 C562647 M Manual -1 2011-08-08 12:56:00 3155.95 15502.0 United Kingdom
299982 A563185 B Adjust bad debt 1 2011-08-12 14:50:00 11062.06 NaN United Kingdom
374542 569382 M Manual 1 2011-10-03 16:44:00 3155.95 15502.0 United Kingdom
406404 C571750 M Manual -1 2011-10-19 11:16:00 3949.32 12744.0 Singapore
406406 571751 M Manual 1 2011-10-19 11:18:00 3949.32 12744.0 Singapore
422351 573077 M Manual 1 2011-10-27 14:13:00 4161.06 12536.0 France
422375 C573079 M Manual -2 2011-10-27 14:15:00 4161.06 12536.0 France
422376 573080 M Manual 1 2011-10-27 14:20:00 4161.06 12536.0 France
In [26]:
#they don't provide information about products sold, we can drop them too:
retail.drop(retail[abs(retail['UnitPrice']) > 3000].index, axis=0,inplace=True)
In [27]:
sns.boxplot(x = retail['UnitPrice'])
Out[27]:
<AxesSubplot:xlabel='UnitPrice'>
In [28]:
retail[(retail['UnitPrice']==0) & (retail['Description'].isna()==False)]
Out[28]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
6391 536941 22734 amazon 20 2010-12-03 12:08:00 0.0 NaN United Kingdom
6392 536942 22139 amazon 15 2010-12-03 12:08:00 0.0 NaN United Kingdom
7313 537032 21275 ? -30 2010-12-03 16:50:00 0.0 NaN United Kingdom
9302 537197 22841 ROUND CAKE TIN VINTAGE GREEN 1 2010-12-05 14:02:00 0.0 12647.0 Germany
13217 537425 84968F check -20 2010-12-06 15:35:00 0.0 NaN United Kingdom
... ... ... ... ... ... ... ... ...
535336 581213 22576 check -30 2011-12-07 18:38:00 0.0 NaN United Kingdom
536908 581226 23090 missing -338 2011-12-08 09:56:00 0.0 NaN United Kingdom
538504 581406 46000M POLYESTER FILLER PAD 45x45cm 240 2011-12-08 13:58:00 0.0 NaN United Kingdom
538505 581406 46000S POLYESTER FILLER PAD 40x40cm 300 2011-12-08 13:58:00 0.0 NaN United Kingdom
538919 581422 23169 smashed -235 2011-12-08 15:24:00 0.0 NaN United Kingdom

1054 rows × 8 columns

2.1.2. Quantity

In [29]:
sns.boxplot(x = retail['Quantity'])
Out[29]:
<AxesSubplot:xlabel='Quantity'>
In [30]:
#all negative values have C as a prefix in their invoice number, we will have a closer look at them in next section
retail[retail['Quantity']<0]
Out[30]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
141 C536379 D Discount -1 2010-12-01 09:41:00 27.50 14527.0 United Kingdom
154 C536383 35004C SET OF 3 COLOURED FLYING DUCKS -1 2010-12-01 09:49:00 4.65 15311.0 United Kingdom
235 C536391 22556 PLASTERS IN TIN CIRCUS PARADE -12 2010-12-01 10:24:00 1.65 17548.0 United Kingdom
236 C536391 21984 PACK OF 12 PINK PAISLEY TISSUES -24 2010-12-01 10:24:00 0.29 17548.0 United Kingdom
237 C536391 21983 PACK OF 12 BLUE PAISLEY TISSUES -24 2010-12-01 10:24:00 0.29 17548.0 United Kingdom
... ... ... ... ... ... ... ... ...
540449 C581490 23144 ZINC T-LIGHT HOLDER STARS SMALL -11 2011-12-09 09:57:00 0.83 14397.0 United Kingdom
541541 C581499 M Manual -1 2011-12-09 10:28:00 224.69 15498.0 United Kingdom
541715 C581568 21258 VICTORIAN SEWING BOX LARGE -5 2011-12-09 11:57:00 10.95 15311.0 United Kingdom
541716 C581569 84978 HANGING HEART JAR T-LIGHT HOLDER -1 2011-12-09 11:58:00 1.25 17315.0 United Kingdom
541717 C581569 20979 36 PENCILS TUBE RED RETROSPOT -5 2011-12-09 11:58:00 1.25 17315.0 United Kingdom

10421 rows × 8 columns

In [31]:
retail.isna().sum()
Out[31]:
InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     134250
Country             0
dtype: int64

2.1.3. Descriptions

In [32]:
#while examining descriptions above, we found out that some of descriptions are written in lower case and they are not prod name
#lets check them now
low_case=[]
for i in retail['Description'].unique():
    if str(i).islower():
        low_case.append(i)
low_case.remove(np.nan)
In [33]:
low_case
Out[33]:
['amazon',
 'check',
 'damages',
 'faulty',
 'amazon sales',
 'reverse 21/5/10 adjustment',
 'mouldy, thrown away.',
 'found',
 'counted',
 'label mix up',
 'samples/damages',
 'thrown away',
 'incorrectly made-thrown away.',
 'showroom',
 'wrongly sold as sets',
 'dotcom sold sets',
 'wrongly sold sets',
 '? sold as sets?',
 '?sold as sets?',
 'damages/display',
 'damaged stock',
 'broken',
 'throw away',
 'wrong barcode (22467)',
 'wrongly sold (22719) barcode',
 'wrong barcode',
 'barcode problem',
 '?lost',
 "thrown away-can't sell.",
 "thrown away-can't sell",
 'rcvd be air temp fix for dotcom sit',
 'damages?',
 're dotcom quick fix.',
 'sold in set?',
 'cracked',
 'sold as 22467',
 'damaged',
 'did  a credit  and did not tick ret',
 'adjustment',
 'returned',
 'wrong code?',
 'wrong code',
 'adjust',
 'crushed',
 'damages/showroom etc',
 'samples',
 'mailout ',
 'mailout',
 'sold as set/6 by dotcom',
 'wet/rusty',
 'damages/dotcom?',
 'on cargo order',
 'smashed',
 'reverse previous adjustment',
 'wet damaged',
 'missing',
 'sold as set on dotcom',
 'sold as set on dotcom and amazon',
 'water damage',
 'sold as set by dotcom',
 'printing smudges/thrown away',
 'to push order througha s stock was ',
 'found some more on shelf',
 'mix up with c',
 'mouldy, unsaleable.',
 'wrongly marked. 23343 in box',
 'came coded as 20713',
 'alan hodge cant mamage this section',
 'dotcom',
 'stock creditted wrongly',
 'ebay',
 'incorrectly put back into stock',
 'taig adjust no stock',
 'code mix up? 84930',
 '?display?',
 'sold as 1',
 '?missing',
 'crushed ctn',
 'test',
 'temp adjustment',
 'taig adjust',
 'allocate stock for dotcom orders ta',
 'add stock to allocate online orders',
 'for online retail orders',
 'found box',
 'website fixed',
 'historic computer difference?....se',
 'incorrect stock entry.',
 'michel oops',
 'wrongly coded 20713',
 'wrongly coded-23343',
 'stock check',
 'crushed boxes',
 "can't find",
 'mouldy',
 'wrongly marked 23343',
 '20713 wrongly marked',
 're-adjustment',
 'wrongly coded 23343',
 'wrongly marked',
 'dotcom sales',
 'had been put aside',
 'damages wax',
 'water damaged',
 'wrongly marked carton 22804',
 'missing?',
 'wet rusty',
 'amazon adjust',
 '???lost',
 'dotcomstock',
 'sold with wrong barcode',
 'dotcom adjust',
 'rusty thrown away',
 'rusty throw away',
 'check?',
 '?? missing',
 'wet pallet',
 '????missing',
 '???missing',
 'lost in space',
 'wet?',
 'lost??',
 'wet',
 'wet boxes',
 '????damages????',
 'mixed up',
 'lost']
In [34]:
#they are mostly problematic descriptions
retail[retail['Description'].isin(low_case)]
Out[34]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
6391 536941 22734 amazon 20 2010-12-03 12:08:00 0.0 NaN United Kingdom
6392 536942 22139 amazon 15 2010-12-03 12:08:00 0.0 NaN United Kingdom
13217 537425 84968F check -20 2010-12-06 15:35:00 0.0 NaN United Kingdom
13218 537426 84968E check -35 2010-12-06 15:36:00 0.0 NaN United Kingdom
13264 537432 35833G damages -43 2010-12-06 16:10:00 0.0 NaN United Kingdom
... ... ... ... ... ... ... ... ...
535334 581211 22142 check 14 2011-12-07 18:36:00 0.0 NaN United Kingdom
535335 581212 22578 lost -1050 2011-12-07 18:38:00 0.0 NaN United Kingdom
535336 581213 22576 check -30 2011-12-07 18:38:00 0.0 NaN United Kingdom
536908 581226 23090 missing -338 2011-12-08 09:56:00 0.0 NaN United Kingdom
538919 581422 23169 smashed -235 2011-12-08 15:24:00 0.0 NaN United Kingdom

493 rows × 8 columns

In [35]:
#all of them has the price of zero, so we can drop them and there are only 493 rows
retail[retail['Description'].isin(low_case)]['UnitPrice'].value_counts()
Out[35]:
0.0    493
Name: UnitPrice, dtype: int64
In [36]:
#drop all lowercase in description
retail.drop(retail[retail['Description'].isin(low_case)].index, axis=0, inplace=True)
In [37]:
up_case=[]
for i in retail['Description'].unique():
    if str(i).isupper():
        up_case.append(i)
In [38]:
#check the values that are not in upper case and lower case, mixed ones
mixed=[]
for i in retail[retail['Description'].isin(up_case) == False]['Description'].unique():
   #some of the descriptions are bigger case except weight indicated with lower case g, to eliminate those check first 3 letters
    if str(i)[0:2].isupper() == False:
        mixed.append(i)
In [39]:
#exclude the nans
mixed.remove(np.nan)
mixed
Out[39]:
['Discount',
 'Manual',
 "Dr. Jam's Arouzer Stress Ball",
 '3 TRADITIONAl BISCUIT CUTTERS  SET',
 'Bank Charges',
 '?',
 "Dad's Cab Electronic Meter",
 'Dotcom sales',
 'Dotcomgiftshop Gift Voucher £40.00',
 'Found',
 'Dotcomgiftshop Gift Voucher £50.00',
 'Dotcomgiftshop Gift Voucher £30.00',
 'Dotcomgiftshop Gift Voucher £20.00',
 'Given away',
 'Dotcom',
 'Adjustment',
 'Dotcomgiftshop Gift Voucher £10.00',
 'Dotcom set',
 'Amazon sold sets',
 'Thrown away.',
 "Dotcom sold in 6's",
 'Damaged',
 'mystery! Only ever imported 1800',
 'Display',
 'Missing',
 'damages/credits from ASOS.',
 'Not rcvd in 10/11/2010 delivery',
 'Thrown away-rusty',
 'incorrectly credited C550456 see 47',
 'Next Day Carriage',
 'Water damaged',
 'Printing smudges/thrown away',
 'Show Samples',
 'Damages/samples',
 'Dotcomgiftshop Gift Voucher £100.00',
 'Sold as 1 on dotcom',
 'Crushed',
 '??',
 'Amazon',
 'Found in w/hse',
 'Dagamed',
 'Lighthouse Trading zero invc incorr',
 'Incorrect stock entry.',
 'Wet pallet-thrown away',
 'Had been put aside.',
 'Sale error',
 'High Resolution Image',
 'Amazon Adjustment',
 'Breakages',
 'Marked as 23343',
 '20713',
 'Found by jackie',
 'Damages',
 'Unsaleable, destroyed.',
 'Wrongly mrked had 85123a in box',
 'John Lewis',
 '???']
In [40]:
retail[(retail['Description'].isin(mixed))]
Out[40]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
141 C536379 D Discount -1 2010-12-01 09:41:00 27.50 14527.0 United Kingdom
2239 536569 M Manual 1 2010-12-01 15:35:00 1.25 16274.0 United Kingdom
2250 536569 M Manual 1 2010-12-01 15:35:00 18.95 16274.0 United Kingdom
2567 536592 21594 Dr. Jam's Arouzer Stress Ball 1 2010-12-01 17:06:00 4.21 NaN United Kingdom
3305 536620 22965 3 TRADITIONAl BISCUIT CUTTERS SET 6 2010-12-02 10:27:00 2.10 14135.0 United Kingdom
... ... ... ... ... ... ... ... ...
537782 581336 23444 Next Day Carriage 1 2011-12-08 12:10:00 15.00 16161.0 United Kingdom
538321 581405 M Manual 3 2011-12-08 13:50:00 0.42 13521.0 United Kingdom
539735 581439 22965 3 TRADITIONAl BISCUIT CUTTERS SET 1 2011-12-08 16:30:00 4.13 NaN United Kingdom
541054 581492 22965 3 TRADITIONAl BISCUIT CUTTERS SET 1 2011-12-09 10:03:00 4.13 NaN United Kingdom
541541 C581499 M Manual -1 2011-12-09 10:28:00 224.69 15498.0 United Kingdom

1158 rows × 8 columns

In [41]:
to_drop = ['?', 'Found', 'Given away', 'Thrown away.', 'mystery! Only ever imported 1800', 'Display', 'Missing',
 'damages/credits from ASOS.', 'Not rcvd in 10/11/2010 delivery', 'Thrown away-rusty',
 'incorrectly credited C550456 see 47', 'Damaged', 'Water damaged', 'Printing smudges/thrown away',
 'Show Samples', 'Damages/samples', 'Adjust bad debt', 'Crushed', '??', 'Found in w/hse', 'Dagamed', 'Incorrect stock entry.',
 'Wet pallet-thrown away', 'Had been put aside.', 'Sale error', 'Breakages', 'Marked as 23343', '20713', 'Found by jackie',
 'Damages', 'Unsaleable, destroyed.', 'Wrongly mrked had 85123a in box', 'John Lewis', '???']
In [42]:
retail[retail['Description'].isin(to_drop)]
#no unitprice, no customerid, problematic description: should be dropped
Out[42]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
7313 537032 21275 ? -30 2010-12-03 16:50:00 0.0 NaN United Kingdom
21518 538090 20956 ? -723 2010-12-09 14:48:00 0.0 NaN United Kingdom
38261 539494 21479 ? 752 2010-12-20 10:36:00 0.0 NaN United Kingdom
39047 539611 85135B Found 53 2010-12-20 14:33:00 0.0 NaN United Kingdom
43662 540100 22837 ? -106 2011-01-04 16:53:00 0.0 NaN United Kingdom
... ... ... ... ... ... ... ... ...
431383 573598 79342B Unsaleable, destroyed. -1128 2011-10-31 15:18:00 0.0 NaN United Kingdom
455407 575615 82582 ?? -130 2011-11-10 12:51:00 0.0 NaN United Kingdom
456830 575721 22804 Wrongly mrked had 85123a in box -256 2011-11-10 18:19:00 0.0 NaN United Kingdom
478681 577102 21915 John Lewis 200 2011-11-17 17:01:00 0.0 NaN United Kingdom
524370 580547 21201 ??? -390 2011-12-05 09:29:00 0.0 NaN United Kingdom

114 rows × 8 columns

In [43]:
#drop problematic descriptions
retail.drop(retail[retail['Description'].isin(to_drop)].index, axis=0, inplace=True)

Filling Nan descriptions

In [44]:
nan_stock_code = retail[retail['Description'].isna()]['StockCode'].unique()
In [45]:
#fill the nan values with most common value for a stockcode when it has at least one description
for i in nan_stock_code:
    if len(retail[(retail['StockCode']==i)]['Description'].value_counts()) != 0:
        retail.loc[(retail['StockCode']==i) & (retail['Description'].isna()), 'Description'] = retail[retail['StockCode']==i]['Description'].value_counts().index[0]
In [46]:
retail.isna().sum()
Out[46]:
InvoiceNo           0
StockCode           0
Description       118
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     133643
Country             0
dtype: int64
In [47]:
#drop 118 rows with no description provided
retail.dropna(subset=['Description'], inplace=True)
retail.isna().sum()
Out[47]:
InvoiceNo           0
StockCode           0
Description         0
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     133525
Country             0
dtype: int64

2.1.4. Cancelled invoices

In [48]:
#create a function that will assign 0 to all variables unless the first character of 'InvoiceNo' in 'C'
def cancel(row):
    value = 0
    if row['InvoiceNo'][0] == 'C':
        value = 1
    return value

# Create a new column 'Cancel' to attach to 'data' and set it to the value returned 
  #by the function cancel().
    
# The code 'axis=1' makes the apply function process the dataset by row, 
  #as opposed to by column which is the default option.
retail['Cancel'] = retail.apply(cancel, axis=1)
In [49]:
# get cancelled transactions
cancelled_orders = retail[retail['Cancel']==1]
cancelled_orders
Out[49]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country Cancel
141 C536379 D Discount -1 2010-12-01 09:41:00 27.50 14527.0 United Kingdom 1
154 C536383 35004C SET OF 3 COLOURED FLYING DUCKS -1 2010-12-01 09:49:00 4.65 15311.0 United Kingdom 1
235 C536391 22556 PLASTERS IN TIN CIRCUS PARADE -12 2010-12-01 10:24:00 1.65 17548.0 United Kingdom 1
236 C536391 21984 PACK OF 12 PINK PAISLEY TISSUES -24 2010-12-01 10:24:00 0.29 17548.0 United Kingdom 1
237 C536391 21983 PACK OF 12 BLUE PAISLEY TISSUES -24 2010-12-01 10:24:00 0.29 17548.0 United Kingdom 1
... ... ... ... ... ... ... ... ... ...
540449 C581490 23144 ZINC T-LIGHT HOLDER STARS SMALL -11 2011-12-09 09:57:00 0.83 14397.0 United Kingdom 1
541541 C581499 M Manual -1 2011-12-09 10:28:00 224.69 15498.0 United Kingdom 1
541715 C581568 21258 VICTORIAN SEWING BOX LARGE -5 2011-12-09 11:57:00 10.95 15311.0 United Kingdom 1
541716 C581569 84978 HANGING HEART JAR T-LIGHT HOLDER -1 2011-12-09 11:58:00 1.25 17315.0 United Kingdom 1
541717 C581569 20979 36 PENCILS TUBE RED RETROSPOT -5 2011-12-09 11:58:00 1.25 17315.0 United Kingdom 1

9085 rows × 9 columns

In [50]:
cancelled_orders[cancelled_orders['Quantity']>0]
Out[50]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country Cancel

Negative values in the Quantity column, mean that it's a cancelled quantity because we didn't find any positive value for orders where InvoiceNo contains the prefix C. How much cancelled orders do we have?

In [51]:
#check how many rows our dataframe of cancelled orders contain
print("We have ",cancelled_orders['InvoiceNo'].nunique(), " cancelled orders.")
#percentage of cancelled orders in total orders
total_orders = retail['InvoiceNo'].nunique()
cancelled_number = cancelled_orders['InvoiceNo'].nunique()
print('Percentage of orders canceled: {}/{} ({:.2f}%) '.format(cancelled_number, total_orders, cancelled_number/total_orders*100))
We have  3731  cancelled orders.
Percentage of orders canceled: 3731/24985 (14.93%) 

2.1.5. Non-Cancelled invoices

In [52]:
#we will explore only positive not-cancelled orders for old customer for modelling 
retail_clean = retail[retail['Quantity'] > 0]
retail_clean = retail_clean.dropna(subset=['CustomerID'])
In [53]:
# Distribution of number of purchases
data1 = retail_clean['CustomerID'].value_counts()  # count of events per visitorid
data2 = data1.value_counts(normalize=True)[:9]
data2[10] = data1.value_counts(normalize=True)[9:].sum()  # count of counts of events per visitorid

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(17,5))
ax1.boxplot(data1)
ax2.bar(data2.index, data2.values)

ax2.set_xticks(list(range(1,11)))
ax2.set_xticklabels(list(range(1,10)) +['10+'])
fig.suptitle("Distribution of number of visitor events")

plt.show()
print("{0:.2f}% of customers have more than 1 purchase!".format(100 * (np.sum(data1 > 1) / data1.shape[0])))
98.29% of customers have more than 1 purchase!
In [54]:
# Distribution of number of item events
data1 = retail_clean['StockCode'].value_counts()  # count of events per item
data2 = data1.value_counts(normalize=True)[:9]
data2[10] = data1.value_counts(normalize=True)[9:].sum()  # count of counts of events per visitorid

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(17,5))
ax1.boxplot(data1)
ax2.bar(data2.index, data2.values)

ax2.set_xticks(list(range(1,11)))
ax2.set_xticklabels(list(range(1,10)) +['10+'])
fig.suptitle("Distribution of number of item events")

plt.show()
print("{0:.2f}% of items have more than 1 event!".format(100 * (np.sum(data1 > 1) / data1.shape[0])))
95.44% of items have more than 1 event!
In [55]:
# Create an additional column for date as year and month
retail_clean['Date'] = retail_clean['InvoiceDate'].dt.strftime("%Y-%m")

# Create a new column for the total expenditure of that product in the purchase.
retail_clean['Sales'] = (retail_clean['UnitPrice'] * retail_clean['Quantity'])
In [56]:
#Visualize the variable productsNumber distribution
groupby_invoice = pd.DataFrame(retail_clean.groupby('InvoiceNo')['StockCode'].nunique())
groupby_invoice.columns=['productsNumber']
fig, ax = plt.subplots()
fig.set_size_inches(10, 6)
sns.distplot(groupby_invoice['productsNumber'],ax=ax)
plt.show()
#We have a skewed distribution of products. Most people buy less than 25 items.
In [57]:
fig = px.treemap(retail_clean,
                 path = ['Country'],
                 values='Sales')
fig.show()
In [58]:
#top-5 countries
retail_clean.groupby('Country').sum().sort_values(by='Sales', ascending=False)[:5]
Out[58]:
Quantity UnitPrice Cancel Sales
Country
United Kingdom 4253982 1.014265e+06 0 7.261178e+06
Netherlands 200834 5.686730e+03 0 2.838893e+05
EIRE 140383 3.213496e+04 0 2.652625e+05
Germany 118042 2.581858e+04 0 2.076774e+05
France 110602 2.260586e+04 0 1.851582e+05
In [59]:
retail_clean['CustomerID'].value_counts()
Out[59]:
17841.0    7676
14911.0    5672
14096.0    5095
12748.0    4412
14606.0    2676
           ... 
17956.0       1
15389.0       1
17948.0       1
15510.0       1
12346.0       1
Name: CustomerID, Length: 4339, dtype: int64
In [60]:
# Visualize number of events per day
df = pd.DatetimeIndex(retail_clean['InvoiceDate']).normalize().value_counts().sort_index()
fig = plt.figure(figsize=(12,6))
plt.plot(df.index, df.values, linestyle="-")
plt.xticks(np.arange(df.index[0], df.index[-1], pd.to_timedelta(7, unit='d')), rotation=90)
plt.title('Event frequency time series')
plt.show()
In [61]:
# How many weeks does the dataset has?
diff = (df.index.max() - df.index.min())
print(f"The dataset has {diff.days} days, corresponding to {diff.days//7} weeks.")
The dataset has 373 days, corresponding to 53 weeks.

3. ABC-XYZ Analysis

3.1 Revenue Analysis

In [62]:
def ABC_analysis(df):
    grouped_df = (
            df.loc[:, ['CustomerID','Sales']]
            .groupby('CustomerID')
            .sum()         
        )

    grouped_df = grouped_df.sort_values(by=['Sales'], ascending=False)
    
    ## Ranking by importance
    grouped_df["Rank"] = grouped_df['Sales'].rank(ascending = False)
    grouped_df["Importance"] = ' '
    grouped_df = grouped_df.reset_index()

    ## Checking the Importance of the Customers and Categorising into class A,B,C and splitting based on 20-30-50
    grouped_df['Importance'][0: int(0.2 * grouped_df['Rank'].max())] = 'A'
    grouped_df['Importance'][int(0.2 * grouped_df['Rank'].max()) : int(0.5 * grouped_df['Rank'].max())] = 'B'
    grouped_df['Importance'][int(0.5 * grouped_df['Rank'].max()): ] = 'C'                  
    
    return grouped_df
In [63]:
ABC_groups = ABC_analysis(retail_clean)
In [64]:
ABC_groups.head()
Out[64]:
CustomerID Sales Rank Importance
0 14646.0 279138.02 1.0 A
1 18102.0 259657.30 2.0 A
2 17450.0 194390.79 3.0 A
3 16446.0 168472.50 4.0 A
4 14911.0 143711.17 5.0 A
In [65]:
sns.barplot(x = 'Importance', y = 'Sales',data = ABC_groups)
ABC_groups['Importance'].value_counts()
Out[65]:
C    2170
B    1302
A     867
Name: Importance, dtype: int64
In [66]:
print("Now let's see importance contribution of each group")
ABC_groups.groupby('Importance')['Sales'].sum() / ABC_groups['Sales'].sum() * 100.
Now let's see importance contribution of each group
Out[66]:
Importance
A    74.612977
B    17.560549
C     7.826474
Name: Sales, dtype: float64

3.2 Frequency Analsysis

In [67]:
retail_clean['Y-M'] = pd.to_datetime(retail_clean['InvoiceDate']).dt.to_period('M')
In [68]:
retail_clean.groupby(['CustomerID', 'Y-M'])['Y-M'].count()
Out[68]:
CustomerID  Y-M    
12346.0     2011-01      1
12347.0     2010-12     31
            2011-01     29
            2011-04     24
            2011-06     18
                      ... 
18283.0     2011-10     38
            2011-11    209
            2011-12     50
18287.0     2011-05     29
            2011-10     41
Name: Y-M, Length: 13046, dtype: int64
In [69]:
pt = retail_clean.pivot_table(values='InvoiceNo', index='CustomerID', columns='Y-M', aggfunc=lambda x: 1 if len(x)>0 else 0).fillna(0)
pt.head()
Out[69]:
Y-M 2010-12 2011-01 2011-02 2011-03 2011-04 2011-05 2011-06 2011-07 2011-08 2011-09 2011-10 2011-11 2011-12
CustomerID
12346.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
12347.0 1.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0
12348.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
12349.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
12350.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
In [70]:
pt['sum'] = pt.sum(axis=1)
In [71]:
customer_freq = pd.DataFrame()
customer_freq['customer'] = pt.index
customer_freq['freq_val'] = pt['sum'].values
In [72]:
customer_freq['group'] = customer_freq['freq_val'].map({1: 'Z', 2: 'Z' , 3: 'Z', 4:'Z', 5:'Y', 6:'Y', 7:'Y', 8:'Y' , 9:'X', 10: 'X', 11:'X', 12:'X'}) 
In [73]:
sns.barplot(x = customer_freq.groupby(['group']).agg('count').index, y = customer_freq.groupby(['group']).agg('count').values[:,1])
customer_freq['group'].value_counts().sort_values(ascending=True)
Out[73]:
X     213
Y     619
Z    3463
Name: group, dtype: int64

4. Modeling

Customer-Item Matrix

In [74]:
customer_item_matrix = retail_clean.pivot_table(
    index='CustomerID', 
    columns='StockCode', 
    values='Quantity',
    aggfunc='sum'
)
In [75]:
customer_item_matrix = customer_item_matrix.applymap(lambda x: 1 if x > 0 else 0)
In [76]:
customer_item_matrix.shape
Out[76]:
(4339, 3663)
In [77]:
customer_item_matrix
Out[77]:
StockCode 10002 10080 10120 10123C 10124A 10124G 10125 10133 10135 11001 ... 90214T 90214U 90214V 90214W 90214Y 90214Z BANK CHARGES C2 M PADS
CustomerID
12346.0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
12347.0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
12348.0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
12349.0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
12350.0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
18280.0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
18281.0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
18282.0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
18283.0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
18287.0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

4339 rows × 3663 columns

4.1 Collaborative Filtering

4.1.1 User-based Collaborative Filtering

In [78]:
user_user_sim_matrix = pd.DataFrame(cosine_similarity(customer_item_matrix))
In [79]:
user_user_sim_matrix.columns = customer_item_matrix.index
user_user_sim_matrix['CustomerID'] = customer_item_matrix.index
user_user_sim_matrix = user_user_sim_matrix.set_index('CustomerID')
In [80]:
user_user_sim_matrix.loc[17935.0].sort_values(ascending=False)
Out[80]:
CustomerID
17935.0    1.000000
14813.0    0.269430
16305.0    0.205152
18174.0    0.192450
17029.0    0.192450
             ...   
14820.0    0.000000
14821.0    0.000000
14823.0    0.000000
14829.0    0.000000
15299.0    0.000000
Name: 17935.0, Length: 4339, dtype: float64
In [81]:
items_bought_by_A = set(customer_item_matrix.loc[12350.0].iloc[customer_item_matrix.loc[12350.0].to_numpy().nonzero()].index)
items_bought_by_A
Out[81]:
{'20615',
 '20652',
 '21171',
 '21832',
 '21864',
 '21866',
 '21908',
 '21915',
 '22348',
 '22412',
 '22551',
 '22557',
 '22620',
 '79066K',
 '79191C',
 '84086C'}
In [82]:
items_bought_by_B = set(customer_item_matrix.loc[17935.0].iloc[
    customer_item_matrix.loc[17935.0].to_numpy().nonzero()
].index)
items_bought_by_B
Out[82]:
{'20657',
 '20659',
 '20828',
 '20856',
 '21051',
 '21866',
 '21867',
 '22208',
 '22209',
 '22210',
 '22211',
 '22449',
 '22450',
 '22551',
 '22553',
 '22557',
 '22640',
 '22659',
 '22749',
 '22752',
 '22753',
 '22754',
 '22755',
 '23290',
 '23292',
 '23309',
 '85099B'}
In [83]:
items_to_recommend_to_B = items_bought_by_A - items_bought_by_B
items_to_recommend_to_B
Out[83]:
{'20615',
 '20652',
 '21171',
 '21832',
 '21864',
 '21908',
 '21915',
 '22348',
 '22412',
 '22620',
 '79066K',
 '79191C',
 '84086C'}
In [84]:
retail_clean.loc[
    retail_clean['StockCode'].isin(items_to_recommend_to_B), 
    ['StockCode', 'Description']
].drop_duplicates().set_index('StockCode')
Out[84]:
Description
StockCode
21832 CHOCOLATE CALCULATOR
21915 RED HARMONICA IN BOX
22620 4 TRADITIONAL SPINNING TOPS
79066K RETRO MOD TRAY
21864 UNION JACK FLAG PASSPORT COVER
79191C RETRO PLASTIC ELEPHANT TRAY
21908 CHOCOLATE THIS WAY METAL SIGN
20615 BLUE POLKADOT PASSPORT COVER
20652 BLUE POLKADOT LUGGAGE TAG
22348 TEA BAG PLATE RED RETROSPOT
22412 METAL SIGN NEIGHBOURHOOD WITCH
21171 BATHROOM METAL SIGN
84086C PINK/PURPLE RETRO RADIO

4.1.2 Item-based Collaborative Filtering

In [85]:
item_item_sim_matrix = pd.DataFrame(cosine_similarity(customer_item_matrix.T))
item_item_sim_matrix.columns = customer_item_matrix.T.index
item_item_sim_matrix['StockCode'] = customer_item_matrix.T.index
item_item_sim_matrix = item_item_sim_matrix.set_index('StockCode')
In [86]:
item_item_sim_matrix
Out[86]:
StockCode 10002 10080 10120 10123C 10124A 10124G 10125 10133 10135 11001 ... 90214T 90214U 90214V 90214W 90214Y 90214Z BANK CHARGES C2 M PADS
StockCode
10002 1.000000 0.000000 0.094868 0.091287 0.0 0.000000 0.090351 0.062932 0.098907 0.095346 ... 0.0 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.029361 0.067082 0.0
10080 0.000000 1.000000 0.000000 0.000000 0.0 0.000000 0.032774 0.045655 0.047836 0.000000 ... 0.0 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.016222 0.0
10120 0.094868 0.000000 1.000000 0.115470 0.0 0.000000 0.057143 0.059702 0.041703 0.060302 ... 0.0 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.070711 0.0
10123C 0.091287 0.000000 0.115470 1.000000 0.0 0.000000 0.164957 0.000000 0.000000 0.000000 ... 0.0 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0
10124A 0.000000 0.000000 0.000000 0.000000 1.0 0.447214 0.063888 0.044499 0.000000 0.000000 ... 0.0 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
90214Z 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 ... 1.0 1.0 0.707107 1.0 0.577350 1.0 0.000000 0.000000 0.000000 0.0
BANK CHARGES 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.032969 0.000000 ... 0.0 0.0 0.223607 0.0 0.182574 0.0 1.000000 0.000000 0.089443 0.0
C2 0.029361 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.036955 0.019360 0.055989 ... 0.0 0.0 0.000000 0.0 0.000000 0.0 0.000000 1.000000 0.026261 0.0
M 0.067082 0.016222 0.070711 0.000000 0.0 0.000000 0.070711 0.070360 0.066349 0.106600 ... 0.0 0.0 0.050000 0.0 0.040825 0.0 0.089443 0.026261 1.000000 0.0
PADS 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.049752 0.000000 0.000000 ... 0.0 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.000000 1.0

3663 rows × 3663 columns

In [87]:
top_10_similar_items = list(
    item_item_sim_matrix\
        .loc['23166']\
        .sort_values(ascending=False)\
        .iloc[:10]\
    .index
)
In [88]:
top_10_similar_items
Out[88]:
['23166',
 '23165',
 '23167',
 '22993',
 '23307',
 '22722',
 '22720',
 '22666',
 '23243',
 '22961']
In [89]:
retail_clean.loc[
    retail_clean['StockCode'].isin(top_10_similar_items), 
    ['StockCode', 'Description']
].drop_duplicates().set_index('StockCode').loc[top_10_similar_items]
Out[89]:
Description
StockCode
23166 MEDIUM CERAMIC TOP STORAGE JAR
23165 LARGE CERAMIC TOP STORAGE JAR
23167 SMALL CERAMIC TOP STORAGE JAR
22993 SET OF 4 PANTRY JELLY MOULDS
23307 SET OF 60 PANTRY DESIGN CAKE CASES
22722 SET OF 6 SPICE TINS PANTRY DESIGN
22720 SET OF 3 CAKE TINS PANTRY DESIGN
22666 RECIPE BOX PANTRY YELLOW DESIGN
23243 SET OF TEA COFFEE SUGAR TINS PANTRY
22961 JAM MAKING SET PRINTED

4.1.3 ALS model

In [90]:
users_counts= retail_clean['CustomerID'].value_counts()
items_counts = retail_clean['StockCode'].value_counts()

scores = []
for n_events in range(1, 10):
    users_counts = users_counts[users_counts > n_events]
    items_counts = items_counts[items_counts > n_events]

    scores.append(retail_clean.shape[0] / (len(users_counts) * len(items_counts)) * 100)
px.line(x = range(1, 10), y = scores, labels = {'x':'n_events', 'y':'Sparsity'})
In [91]:
def threshold_ratings(df, uid_min, iid_min, max_iter=None):
    """Removes users and items with less than uid_min and iid_min event occurrences, respectively.
    Credits: https://www.ethanrosenthal.com/2016/10/19/implicit-mf-part-1/
    """
    n_users = df['CustomerID'].nunique()
    n_items = df['StockCode'].nunique()
    sparsity = float(df.shape[0]) / float(n_users * n_items) * 100
    print('Raw dataset info \n-----------------')
    print('Number of users: {}'.format(n_users))
    print('Number of items: {}'.format(n_items))
    print('Sparsity: {:4.3f}%'.format(sparsity))
    
    done, i = False, 0
    while not done:
        # When we exclude users with freq less than uid_min we might end up with new 
        # items with freq less than iid_min, so we will have to alternate back and forth
        starting_shape = df.shape[0]  # number of existing events

        uid_counts = df.groupby('CustomerID').size()  # user id frequencies
        df = df[~df['CustomerID'].isin(uid_counts[uid_counts < uid_min].index.tolist())]  # keep events with users with frequency >= uid_min

        iid_counts = df.groupby('StockCode').size()  # item id frequencies
        df = df[~df['StockCode'].isin(iid_counts[iid_counts < iid_min].index.tolist())]  # keep events with items with frequency >= iid_min

        ending_shape = df.shape[0]  # number of existing events after filters
        i += 1
        if starting_shape == ending_shape or i == max_iter:  # convergence happens
            done = True
    
    if not max_iter:
        assert(df.groupby('CustomerID').size().min() >= uid_min)
        assert(df.groupby('StockCode').size().min() >= iid_min)
    
    n_users = df['CustomerID'].nunique()
    n_items = df['StockCode'].nunique()
    sparsity = float(df.shape[0]) / float(n_users * n_items) * 100
    print('Limited dataset info \n-----------------')
    print('Number of iterations until convergence: {}'.format(i))
    print('Number of users: {}'.format(n_users))
    print('Number of items: {}'.format(n_items))
    print('Sparsity: {:4.3f}%'.format(sparsity))
    return df
In [92]:
# get limited dataset
df_limited = threshold_ratings(retail_clean, 10, 10)
Raw dataset info 
-----------------
Number of users: 4339
Number of items: 3663
Sparsity: 2.464%
Limited dataset info 
-----------------
Number of iterations until convergence: 3
Number of users: 3746
Number of items: 2863
Sparsity: 3.593%

Train-test split

We want to split the train and test events such that:

  • all test events occur after all train events
In [93]:
# How many weeks does the dataset has?
diff = (df_limited.InvoiceDate.max() - df_limited.InvoiceDate.min())
print(f"The dataset has {diff.days} days, corresponding to {diff.days//7} weeks.")
The dataset has 373 days, corresponding to 53 weeks.
In [94]:
# Train-test split
start_train = df_limited['InvoiceDate'].min()
start_test = start_train + pd.to_timedelta(45, unit='w')
end_test = start_test + pd.to_timedelta(5, unit='w')

# Create new limited df
df_limited = df_limited.loc[(df_limited['InvoiceDate'] > start_train) & (df_limited['InvoiceDate'] <= end_test)]

# Create train_split flag
df_limited['train_split'] = (df_limited['InvoiceDate'] <= start_test).astype(int)
print("Proportion of train events: {:.2f}".format(df_limited['train_split'].mean()))
Proportion of train events: 0.82
In [95]:
# Visualize train and test set
df = pd.DatetimeIndex(df_limited['InvoiceDate']).normalize().value_counts().sort_index()
fig = plt.figure(figsize=(12,6))
plt.plot(df.index, df.values, linestyle="-")
plt.xticks(np.arange(df.index[0], df.index[-1], pd.to_timedelta(7, unit='d')), rotation=90)
plt.vlines(start_test, 0, df.max(), linestyles='dashed', color='r', label='train-test split')
plt.legend()
plt.title('Event frequency time series - train and test set')
plt.show()
In [96]:
# the Categoricals data structure consists of a categories array and an integer array of codes which point to 
#    the real value in the categories array
user_cat = df_limited['CustomerID'].astype('category')
item_cat = df_limited['StockCode'].astype("category")

# create a sparse matrix of all the item/user/counts triples for the train set and test set
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html#scipy.sparse.coo_matrix
item_user_train = coo_matrix((df_limited['train_split'],
                              (item_cat.cat.codes,
                               user_cat.cat.codes))).tocsr()
item_user_train.eliminate_zeros()  # remove zero entries
# produce transpose of item_user_train
user_item_train = item_user_train.T

item_user_test = coo_matrix(((~df_limited['train_split'].astype(bool)).astype(int),
                             (item_cat.cat.codes,
                              user_cat.cat.codes))).tocsr()
item_user_test.eliminate_zeros()  # remove zero entries
# produce transpose of item_user_test
user_item_test = item_user_test.T

# map each item and user category to a unique numeric code
user_map = dict(zip(user_cat, user_cat.cat.codes))
item_map = dict(zip(item_cat, item_cat.cat.codes))

def get_keys(value, dictionary):
    """Function to get dictionary keys with specifiec value"""
    return list(dictionary.keys())[list(dictionary.values()).index(value)]

# confirm shapes
print(f"train set shape: {item_user_train.shape} and test set shape: {item_user_test.shape}")

# check sparsity
pzeros_train = 100 * (1 - item_user_train.count_nonzero() / (item_user_train.shape[0] * item_user_train.shape[1]))
pzeros_test = 100 * (1 - item_user_test.count_nonzero() / (item_user_test.shape[0] * item_user_test.shape[1]))
print(f"train set percentage of zeros: {pzeros_train} and test set percentage of zeros: {pzeros_test}")
train set shape: (2851, 3605) and test set shape: (2851, 3605)
train set percentage of zeros: 98.09517647407947 and test set percentage of zeros: 99.46885804479632
In [97]:
# users with no items on the train set and not items on the test set
zero_users_test = (np.squeeze(np.asarray(user_item_test.sum(axis=1))) == 0).nonzero()[0]
zero_users_train = (np.squeeze(np.asarray(user_item_train.sum(axis=1))) == 0).nonzero()[0]
set(zero_users_test).intersection(zero_users_train)
Out[97]:
set()
In [98]:
# most frequent user, item pair in train set
item_id, user_id = np.unravel_index(item_user_train.argmax(), item_user_train.shape)
item_id, user_id = get_keys(item_id, item_map), get_keys(user_id, user_map)
df_limited.loc[(df_limited['CustomerID'] == user_id) & (df_limited['StockCode'] == item_id) & (df_limited['train_split'] == 1)]
Out[98]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country Cancel Date Sales Y-M train_split
1423 536540 C2 CARRIAGE 1 2010-12-01 14:05:00 50.0 14911.0 EIRE 0 2010-12 50.0 2010-12 1
12119 537368 C2 CARRIAGE 1 2010-12-06 12:40:00 50.0 14911.0 EIRE 0 2010-12 50.0 2010-12 1
12452 537378 C2 CARRIAGE 1 2010-12-06 13:06:00 50.0 14911.0 EIRE 0 2010-12 50.0 2010-12 1
37644 539473 C2 CARRIAGE 1 2010-12-19 14:24:00 50.0 14911.0 EIRE 0 2010-12 50.0 2010-12 1
42332 539984 C2 CARRIAGE 1 2010-12-23 14:58:00 50.0 14911.0 EIRE 0 2010-12 50.0 2010-12 1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
369433 569028 C2 CARRIAGE 1 2011-09-30 10:20:00 50.0 14911.0 EIRE 0 2011-09 50.0 2011-09 1
370190 569130 C2 CARRIAGE 1 2011-09-30 13:46:00 50.0 14911.0 EIRE 0 2011-09 50.0 2011-09 1
380247 569739 C2 CARRIAGE 1 2011-10-06 10:48:00 50.0 14911.0 EIRE 0 2011-10 50.0 2011-10 1
391089 570651 C2 CARRIAGE 1 2011-10-11 13:34:00 50.0 14911.0 EIRE 0 2011-10 50.0 2011-10 1
392506 570694 C2 CARRIAGE 1 2011-10-12 08:10:00 50.0 14911.0 EIRE 0 2011-10 50.0 2011-10 1

61 rows × 13 columns

In [99]:
# initialize a model
alpha = 40  # as we observe more evidence for positive preference, our confidence in pui = 1 increases according to alpha (rate of increase)
als_model = AlternatingLeastSquares(factors=200, regularization=0.01, iterations=30, random_state=0)

# train the model on a sparse matrix of item/user/confidence weights
# os.environ['MKL_NUM_THREADS'] = '1'
# os.environ['OPENBLAS_NUM_THREADS'] = '1'
# about the alpha hyperparameter: https://github.com/benfred/implicit/issues/199#issuecomment-490350326
als_model.fit((item_user_train * alpha).astype('double'))
WARNING:root:Intel MKL BLAS detected. Its highly recommend to set the environment variable 'export MKL_NUM_THREADS=1' to disable its internal multithreading
In [100]:
# recommend items for a user. 
# the recommended items have the largest inner product with the user vector
user_id = list(user_map.keys())[0]
recommendations = als_model.recommend(user_map[user_id], user_item_train)
list(map(lambda x: (get_keys(x[0], item_map), x[1]), recommendations))
Out[100]:
[('82482', 1.0086129),
 ('82494L', 1.007163),
 ('71053', 1.0046384),
 ('82483', 1.0035806),
 ('85123A', 1.0032511),
 ('21871', 1.0031767),
 ('82486', 1.0015827),
 ('22633', 0.99967825),
 ('21068', 0.9992939),
 ('20679', 0.99896705)]
In [101]:
# find related items
# the related items have the largest inner product with the item vector
item_id = list(item_map.keys())[0]
related = als_model.similar_items(item_map[item_id])
list(map(lambda x: (get_keys(x[0], item_map), x[1]), related))
Out[101]:
[('22633', 1.0000001),
 ('22866', 0.70801145),
 ('22632', 0.6668256),
 ('22867', 0.64050275),
 ('22865', 0.6276864),
 ('23439', 0.5576435),
 ('47471', 0.44116738),
 ('20622', 0.43994918),
 ('71270', 0.43286),
 ('22834', 0.42712727)]
In [102]:
# show the top 10 items that explain the recommended item to the user
# It is possible to write the LVM as a linear function between preferences and past actions.
# We can then see what are the actions associated with the highest contributions to the given recommendation.
score, contributions, user_weights = als_model.explain(user_map[user_id], 
                                                       user_item_train,
                                                       item_map[item_id])
print("The score of the user/item pair is: ", score)
print("The top N (itemid, score) contributions for this user/item pair are:\n", list(map(lambda x: (get_keys(x[0], item_map), x[1]), contributions)))
The score of the user/item pair is:  0.8337297945735302
The top N (itemid, score) contributions for this user/item pair are:
 [('22633', 0.5998675876413986), ('22632', 0.18084661598572227), ('84029G', 0.03093292915353203), ('71053', 0.017922941159664606), ('22411', 0.017447190241423043), ('82483', 0.015000456531072803), ('84029E', 0.011043571711645404), ('82486', 0.010083624973891635), ('22803', 0.008637886774513876), ('85123A', 0.005609515555132108)]

4.1.4 PopularRecommender

In [103]:
# Baseline: Recommend the most popular items to every user
class PopularRecommender():
    """Baseline Recommender that always suggests the most popular items to every user.
    """
    def fit(self, item_users):
        self.item_id_sort = np.argsort(np.squeeze(np.asarray(item_users.sum(axis=1).reshape(-1))))[::-1]
    
    def recommend(self, userid, user_items, N=10, filter_already_liked_items=None, filter_items=None, recalculate_user=None):
        if filter_already_liked_items != None or filter_items != None or recalculate_user != None:
            raise NotImplementedError("filter_already_liked_items, filter_items and recalculate_user aren't support yet")
        
        return list(zip(self.item_id_sort[:N], range(1, N + 1)))
In [104]:
# Fitting PopularRecommender model
pop_model = PopularRecommender()
pop_model.fit(item_user_train)

4.1.5 Bayesian Personalized Ranking

In [105]:
bpr_model = implicit.bpr.BayesianPersonalizedRanking(factors=200, use_gpu=False, iterations = 120)
bpr_model.fit(item_user_train)

4.1.6 Logistic Matrix Factorization

In [106]:
lmf_model = implicit.lmf.LogisticMatrixFactorization(factors=200, use_gpu=False, iterations = 50)
lmf_model.fit(item_user_train)
100%|██████████| 50/50 [00:34<00:00,  1.43it/s]

5. Evaluation

In [107]:
# Evaluate models. 
# Precision at K, Mean Average Precision at K, Normalized Discounted Cumulative Gain at K, AUC at K
eval_models = {'pop_model': pop_model, 'als_model': als_model, 'lmf_model': lmf_model, 'bpr_model': bpr_model}
eval_table = {}
for k, v in eval_models.items():
    eval_table[k] = ranking_metrics_at_k(v, user_item_train, user_item_test, K=10, show_progress=True, num_threads=0)
eval_table = pd.DataFrame(eval_table)
eval_table
Out[107]:
pop_model als_model lmf_model bpr_model
precision 0.086880 0.045673 0.035862 0.052845
map 0.038730 0.018257 0.013179 0.022652
ndcg 0.089836 0.047069 0.035426 0.055402
auc 0.513039 0.507233 0.504178 0.508094

6. Cold start problem

In [108]:
# first variant - select the 10 most bought products from the last 100 purchases for new customers
data = retail_clean.copy()
data = data.sort_values('InvoiceDate')
recommendations = data.tail(100).sort_values('Quantity').tail(10)['StockCode'].values.tolist()
recommendations
Out[108]:
['22704',
 '23350',
 '84692',
 '23343',
 '23199',
 '85038',
 '23581',
 '20725',
 '85038',
 '20832']
In [109]:
# customers with only 1 purchase
data1 = retail.groupby('CustomerID').filter(lambda x: len(x) == 1)
data2 = retail[retail['CustomerID'].isna()]
new_cust = data1.append(data2)

6.1 Collaborative filtering based on customer's purchase history

In [110]:
ratings_utility_matrix = retail.pivot_table(values='Quantity', index='CustomerID', columns='StockCode', fill_value=0)
ratings_utility_matrix.head()
Out[110]:
StockCode 10002 10080 10120 10123C 10124A 10124G 10125 10133 10135 11001 ... 90214V 90214W 90214Y 90214Z BANK CHARGES C2 CRUK D M PADS
CustomerID
12346.0 0.0 0 0.0 0 0 0 0 0.0 0.0 0.0 ... 0 0 0 0 0 0.0 0 0.0 0.0 0
12347.0 0.0 0 0.0 0 0 0 0 0.0 0.0 0.0 ... 0 0 0 0 0 0.0 0 0.0 0.0 0
12348.0 0.0 0 0.0 0 0 0 0 0.0 0.0 0.0 ... 0 0 0 0 0 0.0 0 0.0 0.0 0
12349.0 0.0 0 0.0 0 0 0 0 0.0 0.0 0.0 ... 0 0 0 0 0 0.0 0 0.0 0.0 0
12350.0 0.0 0 0.0 0 0 0 0 0.0 0.0 0.0 ... 0 0 0 0 0 0.0 0 0.0 0.0 0

5 rows × 3682 columns

In [111]:
ratings_utility_matrix.shape
Out[111]:
(4371, 3682)
In [112]:
X = ratings_utility_matrix.T
X.head()
Out[112]:
CustomerID 12346.0 12347.0 12348.0 12349.0 12350.0 12352.0 12353.0 12354.0 12355.0 12356.0 ... 18273.0 18274.0 18276.0 18277.0 18278.0 18280.0 18281.0 18282.0 18283.0 18287.0
StockCode
10002 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10080 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10120 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10123C 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10124A 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 4371 columns

In [113]:
SVD = TruncatedSVD(n_components=10)
decomposed_matrix = SVD.fit_transform(X)
decomposed_matrix.shape
Out[113]:
(3682, 10)
In [114]:
correlation_matrix = np.corrcoef(decomposed_matrix)
correlation_matrix.shape
Out[114]:
(3682, 3682)
In [115]:
product_names = list(X.index)
In [116]:
i = X.index[99]
In [117]:
product_ID = product_names.index(i)
product_ID
Out[117]:
99
In [118]:
correlation_product_ID = correlation_matrix[product_ID]
correlation_product_ID.shape
Out[118]:
(3682,)
In [119]:
Recommend = list(X.index[correlation_product_ID > 0.90])

# Removes the item already bought by the customer
Recommend.remove(i) 

Recommend[0:9]
Out[119]:
['10135',
 '15058C',
 '15060B',
 '16008',
 '16015',
 '16045',
 '16048',
 '16054',
 '16161G']

6.2 Item to item based recommendation system based on product description

In [120]:
itemset = retail[['StockCode','Description']]
In [121]:
itemset['Description']=itemset['Description'].astype(str)
In [122]:
#if not all(c.islower() for c in itemset['Description']):
itemset['Description'] = [(np.where((x.isupper()),x, np.NaN)) for x in itemset['Description']]
In [123]:
itemset[itemset['Description'] =='nan']
Out[123]:
StockCode Description
141 D nan
482 21705 nan
918 46000M nan
1961 21703 nan
1962 21704 nan
... ... ...
540654 21704 nan
541054 22965 nan
541541 M nan
541612 21705 nan
541615 21705 nan

2473 rows × 2 columns

In [124]:
# filling NANs with most frequent value 

df2 = itemset.groupby('StockCode')['Description'].apply(lambda x: x.fillna(x.mode().iloc[0])).reset_index(drop=True)

df2 = df2.to_frame()
In [125]:
df2[df2['Description']=='nan']
Out[125]:
Description
140 nan
480 nan
893 nan
1921 nan
1922 nan
... ...
532665 nan
533064 nan
533548 nan
533618 nan
533621 nan

2473 rows × 1 columns

In [126]:
product_descriptions = itemset.dropna()
In [127]:
product_descriptions.drop_duplicates(subset='StockCode',inplace=True)
product_descriptions
Out[127]:
StockCode Description
0 85123A WHITE HANGING HEART T-LIGHT HOLDER
1 71053 WHITE METAL LANTERN
2 84406B CREAM CUPID HEARTS COAT HANGER
3 84029G KNITTED UNION FLAG HOT WATER BOTTLE
4 84029E RED WOOLLY HOTTIE WHITE HEART.
... ... ...
509369 85179a GREEN BITTY LIGHT CHAIN
512588 23617 SET 10 CARDS SWIRLY XMAS TREE 17104
527065 90214U LETTER "U" BLING KEY RING
537224 47591b SCOTTIES CHILDRENS APRON
540421 23843 PAPER CRAFT , LITTLE BIRDIE

3935 rows × 2 columns

In [128]:
desc=product_descriptions['Description'].astype(str)
In [129]:
vectorizer = TfidfVectorizer(stop_words='english')
X1 = vectorizer.fit_transform(desc)
X1
Out[129]:
<3935x2020 sparse matrix of type '<class 'numpy.float64'>'
	with 16121 stored elements in Compressed Sparse Row format>
In [130]:
# Fitting K-Means to the dataset

X=X1

kmeans = KMeans(n_clusters = 10, init = 'k-means++')
y_kmeans = kmeans.fit_predict(X)
plt.plot(y_kmeans, ".")
plt.show()
In [131]:
def print_cluster(i):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind]),
    print

Recommendation of product based on current product selected by user. Recommend related product based on frequently bought together

In [132]:
# # Optimal clusters is 

true_k = 10

model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X1)

print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print_cluster(i)
Top terms per cluster:
Cluster 0:
 glass
 necklace
 black
 bracelet
 earrings
 drop
 crystal
 bead
 silver
 pink
Cluster 1:
 heart
 metal
 sign
 decoration
 love
 hanging
 large
 small
 pink
 wicker
Cluster 2:
 rose
 12
 lights
 set
 english
 pack
 tissues
 candle
 pink
 danish
Cluster 3:
 red
 bag
 retrospot
 jumbo
 vintage
 design
 lunch
 set
 charm
 charlotte
Cluster 4:
 wall
 art
 clock
 mirrored
 stitched
 tidy
 diner
 heart
 organiser
 kitchen
Cluster 5:
 set
 pink
 box
 vintage
 nan
 design
 cake
 white
 card
 flower
Cluster 6:
 blue
 polkadot
 french
 sign
 metal
 door
 flower
 paisley
 ceramic
 garden
Cluster 7:
 cover
 cushion
 woven
 food
 passport
 pink
 french
 union
 rose
 crochet
Cluster 8:
 christmas
 tree
 vintage
 50
 decoration
 star
 heart
 set
 10
 ribbons
Cluster 9:
 holder
 light
 hanging
 heart
 glass
 zinc
 bird
 star
 silver
 candle
In [133]:
def show_recommendations(product):
    #print("Cluster ID:")
    Y = vectorizer.transform([product])
    prediction = model.predict(Y)
    #print(prediction)
    print_cluster(prediction[0])
In [134]:
show_recommendations('flower')
Cluster 5:
 set
 pink
 box
 vintage
 nan
 design
 cake
 white
 card
 flower

7. LightFM

In [135]:
from lightfm import LightFM
from lightfm.evaluation import *
In [136]:
counts = df_limited['CustomerID'].value_counts()
item_counts = df_limited['StockCode'].value_counts()
In [137]:
data = df_limited[~df_limited['CustomerID'].isin(counts[counts < 2].index)]
data = data[~data['StockCode'].isin(item_counts[item_counts < 2].index)]
In [138]:
data.StockCode.nunique()
Out[138]:
2851
In [139]:
# Train-test split
start_train = data['InvoiceDate'].min()
start_test = start_train + pd.to_timedelta(45, unit='w')
end_test = start_test + pd.to_timedelta(5, unit='w')

# Create new limited df
data = data.loc[(data['InvoiceDate'] > start_train) & (data['InvoiceDate'] <= end_test)]

# Create train_split flag
data['train_split'] = (data['InvoiceDate'] <= start_test).astype(int)
print("Proportion of train events: {:.2f}".format(data['train_split'].mean()))
Proportion of train events: 0.82
In [140]:
items = data[['StockCode','Description']]
data_train = data[data['train_split']==1]
data_test = data[data['train_split']==0]
In [141]:
# the Categoricals data structure consists of a categories array and an integer array of codes which point to 
#    the real value in the categories array
user_cat = data['CustomerID'].astype('category')
item_cat = data['StockCode'].astype("category")

# create a sparse matrix of all the item/user/counts triples for the train set and test set
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html#scipy.sparse.coo_matrix
item_user_train = coo_matrix((data['train_split'],
                              (item_cat.cat.codes,
                               user_cat.cat.codes))).tocsr()
item_user_train.eliminate_zeros()  # remove zero entries
# produce transpose of item_user_train
user_item_train = item_user_train.T

item_user_test = coo_matrix(((~data['train_split'].astype(bool)).astype(int),
                             (item_cat.cat.codes,
                              user_cat.cat.codes))).tocsr()
item_user_test.eliminate_zeros()  # remove zero entries
# produce transpose of item_user_test
user_item_test = item_user_test.T

# map each item and user category to a unique numeric code
user_map = dict(zip(user_cat, user_cat.cat.codes))
item_map = dict(zip(item_cat, item_cat.cat.codes))

def get_keys(value, dictionary):
    """Function to get dictionary keys with specifiec value"""
    return list(dictionary.keys())[list(dictionary.values()).index(value)]

# confirm shapes
print(f"train set shape: {item_user_train.shape} and test set shape: {item_user_test.shape}")

# check sparsity
pzeros_train = 100 * (1 - item_user_train.count_nonzero() / (item_user_train.shape[0] * item_user_train.shape[1]))
pzeros_test = 100 * (1 - item_user_test.count_nonzero() / (item_user_test.shape[0] * item_user_test.shape[1]))
print(f"train set percentage of zeros: {pzeros_train} and test set percentage of zeros: {pzeros_test}")
train set shape: (2851, 3605) and test set shape: (2851, 3605)
train set percentage of zeros: 98.09517647407947 and test set percentage of zeros: 99.46885804479632
In [142]:
ratings_utility_matrix_train = data_train.pivot_table(values='Quantity', index='CustomerID', columns='StockCode', fill_value=0)
ratings_utility_matrix_train.head()
Out[142]:
StockCode 10002 10080 10120 10125 10133 10135 11001 15030 15034 15036 ... 90201A 90201B 90201C 90209B 90209C 90214A 90214K BANK CHARGES C2 M
CustomerID
12347.0 0.0 0 0 0.0 0.0 0.0 0.0 0 0.0 0.0 ... 0 0 0 0.0 0.0 0.0 0.0 0 0 0.0
12348.0 0.0 0 0 0.0 0.0 0.0 0.0 0 0.0 0.0 ... 0 0 0 0.0 0.0 0.0 0.0 0 0 0.0
12350.0 0.0 0 0 0.0 0.0 0.0 0.0 0 0.0 0.0 ... 0 0 0 0.0 0.0 0.0 0.0 0 0 0.0
12352.0 0.0 0 0 0.0 0.0 0.0 0.0 0 0.0 0.0 ... 0 0 0 0.0 0.0 0.0 0.0 0 0 1.0
12354.0 0.0 0 0 0.0 0.0 0.0 0.0 0 0.0 0.0 ... 0 0 0 0.0 0.0 0.0 0.0 0 0 0.0

5 rows × 2792 columns

In [143]:
# interactions matrix
interactions = ratings_utility_matrix_train
In [144]:
def create_user_dict(interactions):
    '''
    Function to create a user dictionary based on their index and number in interaction dataset
    Required Input - 
        interactions - dataset create by create_interaction_matrix
    Expected Output -
        user_dict - Dictionary type output containing interaction_index as key and user_id as value
    '''
    user_id = list(interactions.index)
    user_dict = {}
    counter = 0 
    for i in user_id:
        user_dict[i] = counter
        counter += 1
    return user_dict
    
def create_item_dict(df,id_col,name_col):
    '''
    Function to create an item dictionary based on their item_id and item name
    Required Input - 
        - df = Pandas dataframe with Item information
        - id_col = Column name containing unique identifier for an item
        - name_col = Column name containing name of the item
    Expected Output -
        item_dict = Dictionary type output containing item_id as key and item_name as value
    '''
    item_dict ={}
    for i in range(df.shape[0]):
        item_dict[(df.loc[i,id_col])] = df.loc[i,name_col]
    return item_dict

def runMF(interactions, n_components=30, loss='warp', k=15, epoch=30,n_jobs = 4):
    '''
    Function to run matrix-factorization algorithm
    Required Input -
        - interactions = dataset create by create_interaction_matrix
        - n_components = number of embeddings you want to create to define Item and user
        - loss = loss function other options are logistic, brp
        - epoch = number of epochs to run 
        - n_jobs = number of cores used for execution 
    Expected Output  -
        Model - Trained model
    '''
    x = sparse.csr_matrix(interactions.values)
    model = LightFM(no_components= n_components, loss=loss,k=k)
    model.fit(x,epochs=epoch,num_threads = n_jobs)
    return model

def sample_recommendation_user(model, interactions, user_id, user_dict, 
                               item_dict,threshold = 0,nrec_items = 10, show = True):
    '''
    Function to produce user recommendations
    Required Input - 
        - model = Trained matrix factorization model
        - interactions = dataset used for training the model
        - user_id = user ID for which we need to generate recommendation
        - user_dict = Dictionary type input containing interaction_index as key and user_id as value
        - item_dict = Dictionary type input containing item_id as key and item_name as value
        - threshold = value above which the rating is favorable in new interaction matrix
        - nrec_items = Number of output recommendation needed
    Expected Output - 
        - Prints list of items the given user has already bought
        - Prints list of N recommended items  which user hopefully will be interested in
    '''
    n_users, n_items = interactions.shape
    user_x = user_dict[user_id]
    scores = pd.Series(model.predict(user_x,np.arange(n_items)))
    scores.index = interactions.columns
    scores = list(pd.Series(scores.sort_values(ascending=False).index))
    
    known_items = list(pd.Series(interactions.loc[user_id,:] \
                                 [interactions.loc[user_id,:] > threshold].index) \
                       .sort_values(ascending=False))
    
    scores = [x for x in scores if x not in known_items]
    return_score_list = scores[0:nrec_items]
    known_items = list(pd.Series(known_items).apply(lambda x: item_dict[x]))
    scores = list(pd.Series(return_score_list).apply(lambda x: item_dict[x]))
    if show == True:
        print("Known Likes:")
        counter = 1
        for i in known_items:
            print(str(counter) + '- ' + i)
            counter+=1

        print("\n Recommended Items:")
        counter = 1
        for i in scores:
            print(str(counter) + '- ' + i)
            counter+=1
    return return_score_list
    

def sample_recommendation_item(model,interactions,item_id,user_dict,item_dict,number_of_user):
    '''
    Funnction to produce a list of top N interested users for a given item
    Required Input -
        - model = Trained matrix factorization model
        - interactions = dataset used for training the model
        - item_id = item ID for which we need to generate recommended users
        - user_dict =  Dictionary type input containing interaction_index as key and user_id as value
        - item_dict = Dictionary type input containing item_id as key and item_name as value
        - number_of_user = Number of users needed as an output
    Expected Output -
        - user_list = List of recommended users 
    '''
    n_users, n_items = interactions.shape
    x = np.array(interactions.columns)
    scores = pd.Series(model.predict(np.arange(n_users), np.repeat(x.searchsorted(item_id),n_users)))
    user_list = list(interactions.index[scores.sort_values(ascending=False).head(number_of_user).index])
    return user_list 


def create_item_emdedding_distance_matrix(model,interactions):
    '''
    Function to create item-item distance embedding matrix
    Required Input -
        - model = Trained matrix factorization model
        - interactions = dataset used for training the model
    Expected Output -
        - item_emdedding_distance_matrix = Pandas dataframe containing cosine distance matrix b/w items
    '''
    df_item_norm_sparse = sparse.csr_matrix(model.item_embeddings)
    similarities = cosine_similarity(df_item_norm_sparse)
    item_emdedding_distance_matrix = pd.DataFrame(similarities)
    item_emdedding_distance_matrix.columns = interactions.columns
    item_emdedding_distance_matrix.index = interactions.columns
    return item_emdedding_distance_matrix

def item_item_recommendation(item_emdedding_distance_matrix, item_id, 
                             item_dict, n_items = 10, show = True):
    '''
    Function to create item-item recommendation
    Required Input - 
        - item_emdedding_distance_matrix = Pandas dataframe containing cosine distance matrix b/w items
        - item_id  = item ID for which we need to generate recommended items
        - item_dict = Dictionary type input containing item_id as key and item_name as value
        - n_items = Number of items needed as an output
    Expected Output -
        - recommended_items = List of recommended items
    '''
    recommended_items = list(pd.Series(item_emdedding_distance_matrix.loc[item_id,:]. \
                                  sort_values(ascending = False).head(n_items+1). \
                                  index[1:n_items+1]))
    if show == True:
        print("Item of interest :{0}".format(item_dict[item_id]))
        print("Item similar to the above item:")
        counter = 1
        for i in recommended_items:
            print(str(counter) + '- ' +  item_dict[i])
            counter+=1
    return recommended_items
In [145]:
#create user dict
user_dict = create_user_dict(interactions)
In [146]:
items = items.reset_index().drop(columns=['index'])
In [147]:
#create item dict
item_dict = create_item_dict(df = items,
                               id_col = 'StockCode',
                               name_col = 'Description')
In [148]:
# building matrix factorization model
mf_model = runMF(interactions = interactions,
                 n_components = 30,
                 loss = 'warp',
                 k = 15,
                 epoch = 30,
                 n_jobs = 4)
In [149]:
# user recommender
rec_list = sample_recommendation_user(model = mf_model, 
                                      interactions = interactions, 
                                      user_id = 12347.0, 
                                      user_dict = user_dict,
                                      item_dict = item_dict, 
                                      threshold = 4,
                                      nrec_items = 10)
Known Likes:
1- VICTORIAN SEWING KIT
2- BLACK CANDELABRA T-LIGHT HOLDER
3- CHILDRENS CUTLERY POLKADOT PINK
4- CHILDRENS CUTLERY POLKADOT BLUE
5- CHILDRENS CUTLERY RETROSPOT RED 
6- 72 SWEETHEART FAIRY CAKE CASES
7- 60 TEATIME FAIRY CAKE CASES
8- BOX OF 6 ASSORTED COLOUR TEASPOONS
9- BLUE NEW BAROQUE CANDLESTICK CANDLE
10- PINK NEW BAROQUECANDLESTICK CANDLE
11- 3D SHEET OF CAT STICKERS
12- 3D SHEET OF DOG STICKERS
13- 3D DOG PICTURE PLAYING CARDS
14- COLOURED GLASS STAR T-LIGHT HOLDER
15- FEATHER PEN,COAL BLACK
16- TEA TIME OVEN GLOVE
17- RED REFECTORY CLOCK 
18- SET OF 60 VINTAGE LEAF CAKE CASES 
19- SET 40 HEART SHAPE PETIT FOUR CASES
20- TREASURE ISLAND BOOK BOX
21- REGENCY TEA PLATE PINK
22- REGENCY TEA PLATE GREEN 
23- REGENCY TEA PLATE ROSES 
24- REGENCY TEA STRAINER
25- SINGLE ANTIQUE ROSE HOOK IVORY
26- RABBIT NIGHT LIGHT
27- ICE CREAM SUNDAE LIP GLOSS
28- REVOLVER WOODEN RULER 
29- GIFT BAG PSYCHEDELIC APPLES
30- BLUE DRAWER KNOB ACRYLIC EDWARDIAN
31- PURPLE DRAWERKNOB ACRYLIC EDWARDIAN
32- RED DRAWER KNOB ACRYLIC EDWARDIAN
33- GREEN DRAWER KNOB ACRYLIC EDWARDIAN
34- PINK DRAWER KNOB ACRYLIC EDWARDIAN
35- CLEAR DRAWER KNOB ACRYLIC EDWARDIAN
36- ALARM CLOCK BAKELIKE RED 
37- ROSES REGENCY TEACUP AND SAUCER 
38- HOLIDAY FUN LUDO
39- EMERGENCY FIRST AID TIN 
40- MINI PAINT SET VINTAGE 
41- WATERING CAN PINK BUNNY
42- TOOTHPASTE TUBE PEN
43- PACK OF 60 SPACEBOY CAKE CASES
44- AIRLINE BAG VINTAGE TOKYO 78
45- FOUR HOOK  WHITE LOVEBIRDS
46- SMALL HEART MEASURING SPOONS
47- LARGE HEART MEASURING SPOONS
48- MINI LADLE LOVE HEART RED 
49- PACK OF 60 MUSHROOM CAKE CASES
50- PACK OF 60 DINOSAUR CAKE CASES
51- CHOCOLATE CALCULATOR
52- VINTAGE HEADS AND TAILS CARD GAME 
53- RED TOADSTOOL LED NIGHT LIGHT
54- WOODLAND DESIGN  COTTON TOTE BAG
55- BATHROOM METAL SIGN 
56- RED RETROSPOT OVEN GLOVE 
57- BOOM BOX SPEAKER BOYS
58- RED RETROSPOT OVEN GLOVE DOUBLE
59- SET/2 RED RETROSPOT TEA TOWELS 
60- SANDWICH BATH SPONGE
61- CAMOUFLAGE EAR MUFF HEADPHONES
62- BLACK EAR MUFF HEADPHONES
63- WOODLAND CHARLOTTE BAG
64- RED RETROSPOT PURSE 
65- NAMASTE SWAGAT INCENSE
66- SMALL FOLDING SCISSOR(POINTED EDGE)

 Recommended Items:
1- REGENCY CAKESTAND 3 TIER
2- POPCORN HOLDER
3- SET OF 5 PANCAKE DAY MAGNETS
4- CIRCUS PARADE LUNCH BOX 
5- GUMBALL COAT RACK
6- DOLLY GIRL LUNCH BOX
7- AIRLINE BAG VINTAGE JET SET RED
8- SET OF 4 JAM JAR MAGNETS
9- SPACEBOY LUNCH BOX 
10- REGENCY TEAPOT ROSES 
In [150]:
# item-user recommender
sample_recommendation_item(model = mf_model,
                           interactions = interactions,
                           item_id = '10002',
                           user_dict = user_dict,
                           item_dict = item_dict,
                           number_of_user = 15)
Out[150]:
[13483.0,
 17118.0,
 14624.0,
 13743.0,
 13509.0,
 16681.0,
 13657.0,
 14762.0,
 16333.0,
 17334.0,
 15651.0,
 13953.0,
 16682.0,
 17670.0,
 13470.0]
In [151]:
# item-item recommender
item_item_dist = create_item_emdedding_distance_matrix(model = mf_model,
                                                       interactions = interactions)
In [152]:
rec_list = item_item_recommendation(item_emdedding_distance_matrix = item_item_dist,
                                    item_id = '15060B',
                                    item_dict = item_dict,
                                    n_items = 10)
Item of interest :FAIRY CAKE DESIGN UMBRELLA
Item similar to the above item:
1- LAVENDER SCENT CAKE CANDLE
2- TROPICAL PASSPORT COVER 
3- CHRISTMAS PUDDING TRINKET POT 
4- PINK/WHITE CHRISTMAS TREE 60CM
5- BATHROOM SET LOVE HEART DESIGN
6- DIAMANTE HEART SHAPED WALL MIRROR, 
7- COLUMBIAN CANDLE ROUND
8- TEA TIME OVEN GLOVE
9- FENG SHUI PILLAR CANDLE
10- WHITE TEA,COFFEE,SUGAR JARS
In [ ]:
 
In [157]:
def runMF2(interactions, n_components=30, loss='warp', k=15, epoch=30,n_jobs = 4):
    '''
    Function to run matrix-factorization algorithm
    Required Input -
        - interactions = dataset create by create_interaction_matrix
        - n_components = number of embeddings you want to create to define Item and user
        - loss = loss function other options are logistic, brp
        - epoch = number of epochs to run 
        - n_jobs = number of cores used for execution 
    Expected Output  -
        Model - Trained model
    '''
    model = LightFM(no_components= n_components, loss=loss,k=k)
    model.fit(interactions,epochs=epoch,num_threads = n_jobs)
    return model
In [158]:
# building matrix factorization model
mf_model2 = runMF2(interactions = item_user_train,
                 n_components = 30,
                 loss = 'warp',
                 k = 15,
                 epoch = 30,
                 n_jobs = 4)
In [169]:
#create user dict
user_dict = create_user_dict(interactions)
In [170]:
items = items.reset_index().drop(columns=['index'])
In [171]:
#create item dict
item_dict = create_item_dict(df = items,
                               id_col = 'StockCode',
                               name_col = 'Description')
In [163]:
# building matrix factorization model
mf_model2 = runMF2(interactions = item_user_train,
                 n_components = 30,
                 loss = 'warp',
                 k = 15,
                 epoch = 30,
                 n_jobs = 4)
In [172]:
# user recommender
rec_list = sample_recommendation_user(model = mf_model2, 
                                      interactions = interactions, 
                                      user_id = 12347.0, 
                                      user_dict = user_dict,
                                      item_dict = item_dict, 
                                      threshold = 4,
                                      nrec_items = 10)
Known Likes:
1- VICTORIAN SEWING KIT
2- BLACK CANDELABRA T-LIGHT HOLDER
3- CHILDRENS CUTLERY POLKADOT PINK
4- CHILDRENS CUTLERY POLKADOT BLUE
5- CHILDRENS CUTLERY RETROSPOT RED 
6- 72 SWEETHEART FAIRY CAKE CASES
7- 60 TEATIME FAIRY CAKE CASES
8- BOX OF 6 ASSORTED COLOUR TEASPOONS
9- BLUE NEW BAROQUE CANDLESTICK CANDLE
10- PINK NEW BAROQUECANDLESTICK CANDLE
11- 3D SHEET OF CAT STICKERS
12- 3D SHEET OF DOG STICKERS
13- 3D DOG PICTURE PLAYING CARDS
14- COLOURED GLASS STAR T-LIGHT HOLDER
15- FEATHER PEN,COAL BLACK
16- TEA TIME OVEN GLOVE
17- RED REFECTORY CLOCK 
18- SET OF 60 VINTAGE LEAF CAKE CASES 
19- SET 40 HEART SHAPE PETIT FOUR CASES
20- TREASURE ISLAND BOOK BOX
21- REGENCY TEA PLATE PINK
22- REGENCY TEA PLATE GREEN 
23- REGENCY TEA PLATE ROSES 
24- REGENCY TEA STRAINER
25- SINGLE ANTIQUE ROSE HOOK IVORY
26- RABBIT NIGHT LIGHT
27- ICE CREAM SUNDAE LIP GLOSS
28- REVOLVER WOODEN RULER 
29- GIFT BAG PSYCHEDELIC APPLES
30- BLUE DRAWER KNOB ACRYLIC EDWARDIAN
31- PURPLE DRAWERKNOB ACRYLIC EDWARDIAN
32- RED DRAWER KNOB ACRYLIC EDWARDIAN
33- GREEN DRAWER KNOB ACRYLIC EDWARDIAN
34- PINK DRAWER KNOB ACRYLIC EDWARDIAN
35- CLEAR DRAWER KNOB ACRYLIC EDWARDIAN
36- ALARM CLOCK BAKELIKE RED 
37- ROSES REGENCY TEACUP AND SAUCER 
38- HOLIDAY FUN LUDO
39- EMERGENCY FIRST AID TIN 
40- MINI PAINT SET VINTAGE 
41- WATERING CAN PINK BUNNY
42- TOOTHPASTE TUBE PEN
43- PACK OF 60 SPACEBOY CAKE CASES
44- AIRLINE BAG VINTAGE TOKYO 78
45- FOUR HOOK  WHITE LOVEBIRDS
46- SMALL HEART MEASURING SPOONS
47- LARGE HEART MEASURING SPOONS
48- MINI LADLE LOVE HEART RED 
49- PACK OF 60 MUSHROOM CAKE CASES
50- PACK OF 60 DINOSAUR CAKE CASES
51- CHOCOLATE CALCULATOR
52- VINTAGE HEADS AND TAILS CARD GAME 
53- RED TOADSTOOL LED NIGHT LIGHT
54- WOODLAND DESIGN  COTTON TOTE BAG
55- BATHROOM METAL SIGN 
56- RED RETROSPOT OVEN GLOVE 
57- BOOM BOX SPEAKER BOYS
58- RED RETROSPOT OVEN GLOVE DOUBLE
59- SET/2 RED RETROSPOT TEA TOWELS 
60- SANDWICH BATH SPONGE
61- CAMOUFLAGE EAR MUFF HEADPHONES
62- BLACK EAR MUFF HEADPHONES
63- WOODLAND CHARLOTTE BAG
64- RED RETROSPOT PURSE 
65- NAMASTE SWAGAT INCENSE
66- SMALL FOLDING SCISSOR(POINTED EDGE)

 Recommended Items:
1- RED RETROSPOT WASHBAG
2- GARDENERS KNEELING PAD CUP OF TEA 
3- CREAM WALL PLANTER HEART SHAPED
4- TEA BAG PLATE RED RETROSPOT
5- PEG BAG APPLES DESIGN
6- BLACK ENCHANTED FOREST PLACEMAT
7- RETROSPOT WOODEN HEART DECORATION
8- SET OF 20 KIDS COOKIE CUTTERS
9- SMALL WHITE HEART OF WICKER
10- AIRLINE BAG VINTAGE WORLD CHAMPION 
In [177]:
rec_list = item_item_recommendation(item_emdedding_distance_matrix = item_item_dist,
                                    item_id = '15060B',
                                    item_dict = item_dict,
                                    n_items = 10)
Item of interest :FAIRY CAKE DESIGN UMBRELLA
Item similar to the above item:
1- LAVENDER SCENT CAKE CANDLE
2- TROPICAL PASSPORT COVER 
3- CHRISTMAS PUDDING TRINKET POT 
4- PINK/WHITE CHRISTMAS TREE 60CM
5- BATHROOM SET LOVE HEART DESIGN
6- DIAMANTE HEART SHAPED WALL MIRROR, 
7- COLUMBIAN CANDLE ROUND
8- TEA TIME OVEN GLOVE
9- FENG SHUI PILLAR CANDLE
10- WHITE TEA,COFFEE,SUGAR JARS
In [178]:
# Evaluate the trained model
In [179]:
prec = precision_at_k(mf_model2, item_user_test, train_interactions=None, k=10).mean()
rec = recall_at_k(mf_model2, item_user_test, train_interactions=None, k=10).mean()
auc = auc_score(mf_model2, item_user_test, train_interactions=None).mean()
In [180]:
print(prec)
print(rec)
print(auc)
0.10953331
0.07986290339603218
0.653987